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Abstract 



0< ^-minimization refers to finding the minimum £i-norm solution to an underdetermined 

linear system b — Ax. It has recently received much attention, mainly motivated by 
^_^ the new compressive sensing theory that shows under quite general conditions the 

^^ minimum i?i-norm solution is also the sparsest solution to the system of linear equa- 

Utions. Although the underlying problem is a linear program, conventional algorithms 
such as interior-point methods suffer from poor scalability for large-scale real world 
(X3 problems. A number of accelerated algorithms have been recently proposed that take 

advantage of the special structure of the i\ -minimization problem. In this paper, we 
provide a comprehensive review of five representative approaches, namely, Gradient 
£X| Projection, Homotopy, Iterative Shrinkage- Thresholding, Proximal Gradient, and Aug- 

^ mented Lagrange Multiplier. The work is intended to fill in a gap in the existing 

C*") literature to systematically benchmark the performance of these algorithms using a 

l/~) consistent experimental setting. In particular, the paper will focus on a recently pro- 

posed face recognition algorithm, where a sparse representation framework has been 
used to recover human identities from facial images that may be affected by illumi- 
£■ — . nation, occlusion, and facial disguise. MATLAB implementations of the algorithms 

^^ described in this paper have been made publicly available. 
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1 Introduction 

Compressive sensing (CS) has been one of the hot topics in the signal process- 
ing and optimization communities in the last five years or so. In CS theory 
[11, 19, 10, 13], it has been shown that the minimum £i-norm solution to an 
underdetermined system of linear equations is also the sparsest possible solu- 
tion under quite general conditions. More specifically, suppose there exists an 
unknown signal Xq £ K", a measurement vector b £ WL d (d < n), and a mea- 
surement matrix A E R dxn such that A is full rank and b = Axq. Recovering 
Xq given A and b constitutes a non-trivial linear inversion problem, since the 
number of measurements in b is smaller than the number of unknowns in Xq. A 
conventional solution to this problem is the linear least squares, which finds the 
minimum ^2-norm solution (or the solution of least energy) to this system. How- 
ever, if Xq is sufficiently sparse 1 and the sensing matrix A is incoherent with the 
basis under which Xq is sparse (i.e., the identity matrix in the standard form), 
then Xq can be exactly recovered by computing the minimum i?i-norm solution, 
as given by the following optimization problem: 

(Pi) : min ||x||i subj. to b = Ax. (1) 

X 

We refer to the above problem as ^-minimization or ^i-min. The sparsity- 
seeking property of (Pi) has been shown to have applications in geophysics, 
data compression, image processing, sensor networks, and more recently, in 
computer vision. The interested reader is referred to [11, 2, 10, 48, 52] for a 
comprehensive review of these applications. 

The (Pi) problem can be recast as a linear program (LP) and can be solved 
by conventional methods such as interior-point methods. However, the com- 
putational complexity of these general-purpose algorithms is often too high for 
many real-world, large-scale applications. Alternatively, heuristic greedy algo- 
rithms have been developed to approximate (Pi). Orthogonal matching pursuit 
(OMP) [17] and least angle regression (LARS) [22] are two examples in this cat- 
egory. Empirically, these greedy algorithms work better when Xq is very sparse, 
but will deviate from the solution of (Pi) when the number of non-zero entries 
in x increases, as illustrated in [42]. In other words, the greedy methods do 
not come with strong theoretical guarantees for global convergence. 

Besides scalability, another important requirement for real-world applica- 
tions is robustness to noise, namely, the observation vector b may be corrupted 
by data noise. To take into account of the noise, one can relax the equality 
constraint as follows: 

(Pi, 2 ): min||x||i subj. to \\b - Ax\\ 2 < e, (2) 

X 

where e > is a pre-determined noise level. If the observation vector b is 
assumed to be corrupted by white noise of magnitude up to e, then the ground- 
truth signal Xq can be well approximated by (P1.2), dubbed basis pursuit de- 
noismg (BPDN) in [13, 12]. 



1 By sparse, we mean that most entries in the vector are zero. 
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Based on the context of the data noise, the ^2-norm used in the penalty term 
can also be replaced by other f p -norms. In this paper, we also focus on a special 
case of (Pi) given by: 

(Pi i) : min ||a;||i subj. to ||6 — Ax\\i < e. (3) 

X 

This formulation has been used in [49, 47] for robust face recognition. Although 
(Pl,i) can be solved by any algorithm designed for (Pi), as we will later explain, 
its special structure can be further exploited to develop more scalable methods. 

In light of the high interest in finding more efficient algorithms to solve these 
problems, many new algorithms have been proposed. Although it is impossible 
to summarize all existing algorithms in the literature, in this paper, we provide 
a comprehensive review of five representative methods, namely, Gradient Pro- 
jection (GP) [23, 31], Homotopy [41, 34, 21], Iterative Shrinkage- Thresholding 
(1ST) [16, 14, 26, 50], Proximal Gradient (PG) [38, 39, 5, 6], and Augmented 
Lagrange Multiplier (ALM) [8, 53]. Although the paper is mainly focused on 
fast implementations of ^i-min, the reader may refer to [10, 45] for a broader dis- 
cussion about recovering sparse solutions via other approaches, such as greedy 
pursuit- type algorithms. 

This paper intends to fill in a gap in the existing literature to systematically 
benchmark the performance of these algorithms using a fair and consistent ex- 
perimental setting. Due to the attention given to compressive sensing and i\- 
minimization, other more sophisticated solutions continue to be developed at a 
rapid pace. More recent developments include subspace pursuit [15], CoSaMP 
[37], approximate message-passing algorithm [20], and Bregman iterative algo- 
rithm [54], to name a few. However, we do not believe there exists an overall 
winner that could achieve the best performance in terms of both speed and ac- 
curacy for all applications. Therefore, in addition to extensive simulations using 
synthetic data, our experiments will be focused on a specific application of ro- 
bust face recognition proposed in [49] , where a sparse representation framework 
has recently been developed to recognize human identities from facial images. 
To aid peer evaluation, all algorithms discussed in this paper have been made 
available on our website as a MATLAB toolbox: 

http : //www . eecs . berkeley . edu/~yang/sof tware/llbenchmark/ 

1.1 Notation 

For a vector x <G R™, we denote by x + and x^ the vectors that collect the 
positive and negative coefficients of x, respectively: 

x = x + — x_, x + > 0,a;_ > 0. (4) 

We also denote 

X = diag(x 1 ,a: 2 ,--- ,x„)eIR Tlx " (5) 

as a square matrix with the coefficients of x as its diagonal and zero other- 
wise. The concatenation of two (column) vectors will be written following the 
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MATLAB convention: [jci;a;2] = [xl]', [^l,^] = [ Xl X2 ]- We denote by 1 a 
vector whose components are all one with dimension defined within the con- 
text. We represent the Euclidean or ^2-norm by 1 1 - 1 1 2 and the £i-norm by || • ||i. 
The notation || • || represents the ^ 2 - n orm for vectors and the spectral norm for 
matrices. 

For any real- valued differentiable function /(•), we denote its gradient by 
V/(-). If a real-valued convex function g(-) is not differentiable everywhere, we 
represent its subgradient by dg(-), defined as follows: 

dg(x) = {rjeR n : g(x) - g{x) > rf{x - x),\/x € K n }. (6) 

1.2 Primal-Dual Interior-Point Methods 

We first discuss a classical solution to the £i-min problem (Pi), called the primal- 
dual interior-point method, which is usually attributed to the works of [24, 29, 
35, 36, 32]. For the sake of simplicity, we assume here that the sparse solution 
x is nonnegative. 2 Under this assumption, it is easy to see that (Pi) can be 
converted to the standard primal and dual forms in linear programming: 



(7) 
z = c 



where for £i-min, c = 1. The primal-dual algorithm simultaneously solves for 
the primal and dual optimal solutions [9]. 

It was proposed in [24] that (P) can be converted to a family of logarithmic 
barrier problems 3 : 

(Pfj) : min x c T x - fJ.J27=i lo S x i- (g\ 

subj. to Ax = b, x > 

Clearly, a feasible solution x to (P M ) cannot have zero coefficients. Therefore, 
we define the interiors of the solution domains for (P) and (D) as: 

P ++ = {x : Ax = b, x > 0}, 

D ++ = {(y,z):A T y + z = c,z>0}, (9) 

S++ = P++xP ++ . 

Assuming that the above sets are non-empty, it can be shown that (P M ) 
has a unique global optimal solution x(/i) for all /j, > 0. As /i — > 0, x(fi) and 
(y(p), z(/j,)) converge to optimal solutions of problems (P) and (D) respectively 
[35, 36]. 





Primal (P) 




Dual 


min x 


T 

c x 


max y>z 


b r y 


subj. to 


Ax = b 


subj. to 


A T y^ 




x>0 




z > 0, 



2 This constraint can be easily removed by considering the linear system b = 
[A, — j4][aj_|_;x_], where [x+;a:_] is also nonnegative. 

3 In general, any smooth function *P that satisfies \l/(0 + ) = —00 is a valid barrier function 

[27]. 
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The primal-dual interior-point algorithm seeks the domain of the central 
trajectory for the problems (P) and (D) in S++ , where the central trajectory is 
defined as the set S = {(x(fi), y(/i), z(p)) : fi > 0} of solutions to the following 
system of equations: 

XZ1 = fil, Ax = b, A T y + z = c, , . 

x>0,z>0. ^ ' 

The above condition is also known as the Karush-Kuhn- Tucker (KKT) condi- 
tions for the convex program (P M ) [36, 32]. 

Hence, the update rule on the current value (x^ k \y^ ,z^ k ') is defined by 
the Newton direction (Ax, Ay, Az), which is computed as the solution to the 
following set of linear equations: 

Z^Ax + X^Az = frl-X^zW, 

AAx = 0, (11) 

A T Ay + Az= 0, 

where fi is a penalty parameter that is generally different from \x in (P^). 

In addition to the update rule (11), an algorithm also needs to specify a 
stopping criterion when the solution is close to the optimum. For ^i-min, some 
simple rules can be easily evaluated: 

1. The relative change of the sparse support set becomes small; 

2. The relative change (in the sense of the £ 2 - n orm) of the update of the 
estimate becomes small; 

3. The relative change of the objective function becomes small. 

A more detailed discussion about choosing good stopping criteria in different 
applications is postponed to Section 3. 

Algorithm 1 summarizes a conceptual implementation of the interior-point 
methods. 4 For more details about how to choose the initial values (x^ , y^ , z^ ) 
and the penalty parameter jx, the reader is referred to [32, 36]. Algorithm 1 re- 
quires a total of 0(\/n) iterations, and each iteration can be executed in 0(n 3 ) 
operations for solving the linear system (11). 

In one simulation shown in Figure 1, the computational complexity of Al- 
gorithm 1 with respect to (w.r.t.) the sensing dimension d grows much faster 
than the other five algorithms in comparison. For example, at d — 1900, the 
fastest algorithm in this simulation, i.e., DALM, only takes about 0.08 sec to 
complete one trial, which is more than 250 times faster than PDIPA. For this 
reason, basic BP algorithms should be used with caution in solving real-world 
applications. The details of the simulation are explained later in Section 3. 



4 There are multiple versions of the primal-dual interior-point solver implemented in MAT- 
LAB. Notable examples include SparseLab at http://sparselab.stanford.edu/, the CVX en- 
vironment at http://cvxr.com/cvx/, and the ^imagic package at http://www.acm. caltech. 
edu/llmagic/. 
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Algorithm 1 Primal-Dual Interior-Point Algorithm (PDIPA) 



Input: A full rank matrix A € 



pdxn 



( x (o) )?/ (o) )Z (0))_ Iteration k <- 
< 5 < s/n. 
1: repeat 

k <- k + 1, n ■<- /i(l - S/y/n). 
Solve (11) for (Ace, Ay, As), 
^(fc) ^_ ^(fc-i) + Ace, y(*) <- yC " 1 
until stopping criterion is satisfied. 



d < n, a vector b G R , initialization 
Initial penalty /x and a decreasing factor 



Output: x* <— x 



<» 



Ay,z^ -^^C - 1 ) + Az. 




*---#= =#= = » = =*==» --* — i 

1800 1400 1600 1000 S000 

Projection Dimension 



Average run time of PDIPA in comparison with five other fast algorithms. 
The simulation setup: n = 2000, k — 200. The projection matrices are 
randomly generated based on the standard normal distribution with the 
dimension varies from 300 to 1900. The support of the ground truth 
Xq is randomly selected at each trial, and the nonzero coefficients are 
sampled from the normal distribution. 



Next, we will review the five fast ^i-min algorithms shown in Figure 1, 
namely, Gradient Projection in Section 2.1, Homotopy in Section 2.2, Iterative 
Shrinkage-Thresholding in Section 2.3, Proximal Gradient in Section 2.4, and 
Augmented Lagrange Multiplier in Section 2.5. The algorithms of L1LS, Ho- 
motopy, and SpaRSA are provided by their respective authors. We have also 
provided our implementation of FISTA and DALM on our website. 



2 Fast £i-Min Algorithms 

2.1 Gradient Projection Methods 

We first discuss Gradient Projection (GP) methods that seek a sparse repre- 
sentation x along a certain gradient direction, which induces much faster con- 
vergence speed. The approach reformulates the ^i-min problem as a quadratic 
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programming (QP) problem. 

We start with the £i-min problem (-Pi, 2)- It is equivalent to the so-called 
LASSO problem [44]: 

(LASSO) : mm\\b-Ax\\l subj. to ||aj||i < o", (12) 

X 

where a > is an appropriately chosen constant. Using a Lagrangian for- 
mulation, the problem (LASSO) (and hence, (Pi. 2)) can be rewritten as an 
unconstrained optimization problem: 



-5' 

x Z 



x* = argminP(a;) = argmin -||6 — Ax\\\ + A||a~||i, (13) 



where A is the Lagrangian multiplier. 

In the literature, there exist two slightly different methods that formu- 
late (13) as a quadratic programming problem, namely, gradient projection 
sparse representation (GPSR) [23] and truncated Newton interior-point method 
(TNIPM) [31]. 5 

To formulate the GPSR algorithm, we separate the positive coefficients x + 
and the negative coefficients a;_ in x, and rewrite (13) as 



min x Q(x) = |||6- [A, -A][x+; X-]\\% + Al T (a; + + 
subj. to x + > 0, a;_ > 0. 

Problem (14) can be rewritten in the standard QP form as 

min Q(z) = c T z + \z T Bz 
subj. to z > 0, 



where z = [x+;x_], c — Al + [— A T b;A T b], and 

B = 



A T A -A T A 
-A T A A T A 



(14) 



(15) 



(16) 



We note that the gradient of Q(z) is defined as 

W z Q(z)=c + Bz. (17) 

This leads to a steepest-descent algorithm that searches from each iterate z^ 
along the negative gradient — WQ(z): 

z (W) = ^)_ a (*)VQ( z (')), (18) 

where a' ' is the step size. This can be solved by a standard line-search process 
[28]. For example, in [23], the direction vector g( k > is defined as 

„(*) = /(VQ(2 (t) ))i, ifzf ) >0or(VQ( 2 W)) 4 <0 nq . 

y * I 0, otherwise. l ; 



5 A MATLAB implementation of GPSR is available at http : //www . lx . it . pt/~mtf /GPSR. A 
MATLAB Toolbox for TNIPM called L1LS is available at http://www.stanford.edu/~boyd/ 
ll_ls/. 
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Then the step size for the update is chosen to be 

a {k) = argminQ(z (fe) - ag {k) ), (20) 

a 

which has a closed-form solution 

,,, (o( fe )) T g( fe) 

,y(k) _ _w !_y (-911 

(g( k )) T Bg( k )' l ' 

The computational complexity and convergence of GPSR is difficult to es- 
timate exactly [23]. Another issue is that the formulation of (15) doubles the 
dimension of the equations from (13). Therefore, the matrix operations involv- 
ing B must take into account its special structure w.r.t. A and A T . 

The second GP algorithm, which we will benchmark in Section 3, is truncated 
Newton interior-point method (TNIPM) [31]. It transforms the same objective 
function (13) to a quadratic program but with inequality constraints: 

min IWAx-bWl + XY^^Ui. ^ 

subj. to — m < Xi < Ui, i = 1, • • • , n 

Then a logarithmic barrier function for the constraints — m < Xi < Ui is con- 
structed as follows [24]: 

$(x, u) = ~y] log(«i + x{] - ^ \og(ui - Xi). (23) 

i i 

Over the domain of (x,w), the central path consists of the unique niinimizer 
(x* (t) , u* (t)) of the convex function 

n 

F t (x, u) = t(\\Ax - b\\l + A^u 4 ) + $(x, u), (24) 

i=\ 

where the parameter t € [0,oo). 

Using the primal barrier method discussed in Section 1.2, the optimal search 
direction using Newton's method is computed by 



V 2 F t (x,u) 



Ax 

Am 



Vftf^ujer". (25) 



Again, for large-scale problems, directly solving (25) is computationally ex- 
pensive. In [31], the search step is accelerated by a preconditioned conjugate 
gradients (PCG) algorithm, where an efficient preconditioner is proposed to ap- 
proximate the Hessian of |||Aa; — 6|||. The reader is referred to [30, 40] for more 
details about PCG. 

2.2 Homotopy Methods 

One of the drawbacks of the PDIPA method is that they require the solution 
sequence x(/j,) to be close to a "central path" as fi — > 0, which sometimes is 
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difficult to satisfy and computationally expensive in practice. In this section, 
we review an approach called Homotopy methods [41, 34, 21] that can mitigate 
these issues. 

We recall that (-Pi ,2) can be written as an unconstrained convex optimization 
problem: 

x* = arg min x F(x) = argminj. i||6 - Ax\\% + X\\x\\i, , . 

= arg min x f(x) + Xg(x) 

where f(x) = |||6 — Aa;|||, g{x) — \\x\\i, and A > is the Lagrange multiplier. 
On one hand, w.r.t. a fixed A, the optimal solution is achieved when G dF{x). 
On the other hand, similar to the interior-point algorithm, if we define 

X = {x* x :XG[0,oo)}, (27) 

X identifies a solution path that follows the change in A: when A — > 00, x* x = 0; 
when A — > 0, x x converges to the solution of (Pi). 

The Homotopy methods exploit the fact that the objective function F(x) 
undergoes a homotopy from the £2 constraint to the t\ objective in (26) as A 
decreases. One can further show that the solution path X is piece-wise constant 
as a function of A [41, 22, 21]. Therefore, in constructing a decreasing sequence 
of A, it is only necessary to identify those "breakpoints" that lead to changes 
of the support set of x* x , namely, either a new nonzero coefficient added or a 
previous nonzero coefficient removed. 

The algorithm operates in an iterative fashion with an initial value x*- ' = 0. 
In each iteration, given a nonzero A, we solve for x satisfying dF(x) = 0. The 
first summand / in (26) is differentiable: V/ = A T (Ax — b) = — c{x). The 
subgradient of g (x) = \\x\\\ is given by: 

«(«o=»imii = {«€r " : ; e [_i,i],w=o}- ( 28 ) 

Thus, the solution to dF(x) = is also the solution to the following equation: 
c(x) = A T b- A T Ax = Xu(x). (29) 

By the definition (28), the sparse support set at each iteration is given by 

l={i:\cf ) \=X}. (30) 

The algorithm then computes the update for x^ ' in terms of the direction 
and the magnitude separately. Specifically, the update direction on the sparse 
support d ' (I) is the solution to the following system: 

^Axd( fe )(I)=sgn(c( fc )(Z)), (31) 

where Ax is a submatrix of A that collects the column vectors of A w.r.t. T, 
and c( k \l) is a vector that contains the coefficients of c^ w.r.t. X. For the 
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coefficients whose indices are not in I, their update directions are manually set 
to zero. Along the direction indicated by d> ' , there are two scenarios when an 
update on x may lead to a breakpoint where the condition (29) is violated. The 
first scenario occurs when an element of c not in the support set would increase 
in magnitude beyond A: 

+ _ • j A-Cj X + c l \ 

7 i«ff\l-aT^ x d( fc )(I)'l + oTAxd( fc )(I)J' l ' 

The index that achieves 7 + is denoted as i + . The second scenario occurs when 
an element of c in the support set T crosses zero, violating the sign agreement: 

7~ = mm{-Xi/di}. (33) 

The index that achieves 7~ is denoted as i~ . Hence, the homotopy algorithm 
marches to the next breakpoint, and updates the sparse support set by either 
appending X with i + or removing i~ : 

x (k+i) =aj (fc) +m i n { 7 +, 7 -}dW. (34) 

The algorithm terminates when the relative change in x between consecutive 
iterations is sufficiently small. Algorithm 2 summarizes an implementation of 
the Homotopy methods. 6 

Algorithm 2 Homotopy 

Input: A full rank matrix A = [«i,-"" i v n\ £ M dx ™, 

d < n, a vector b e R d , initial Lagrangian parameter A = 
2p T 6|U- 
1: Initialization: fc <— 0. Find the first support index: i — argmax^ =1 ||v^b||, 

2: repeat 

3: fc «- fc + 1. 

4: Solve for the update direction d} ' in (31). 

5: Compute the sparse support updates (32) and (33): 7* <— min{7 + ,7~}. 
Update x^ k \ I, and A <— A — 7*. 
until stopping criterion is satisfied. 
Output: x* (- x^ k \ 

Overall, solving (31) using a Cholesky factorization and the addition/removal 
of the sparse support elements dominate the computation. Since one can keep 
track of the rank-1 update of A^Az in solving (31) using 0(d 2 ) operations 
in each iteration, the computational complexity of the homotopy algorithm is 
0{kd 2 + kdn). 



6 A MATLAB implementation [1] can be found at http://users.ece.gatech.edu/~sasif/ 
homotopy/. 
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Finally, we want to point out that Homotopy has been shown to share some 
connections with two greedy £i-min approximations, namely, least angle regres- 
sion (LARS) [22] and polytope faces pursuit (PFP) [42]. For instance, if the 
ground-truth signal x has at most k non-zero components with k << n, all 
three algorithms can recover it in k iterations. On the other hand, LARS never 
removes indices from the sparse support set during the iteration, while Homo- 
topy and PFP have mechanisms to remove coefficients from the sparse support. 
More importantly, Homotopy provably solves £i-min (Pi), while LARS and PFP 
are only approximate solutions. A more detailed comparison between Homo- 
topy, LARS, and PFP can be found in [21]. 

2.3 Iterative Shrinkage-Thresholding Methods 

Although Homotopy employs a more efficient iterative update rule that only 
involves operations on those submatrices of A corresponding to the support sets 
of x, it may not be as efficient when the sparsity k and the observation dimension 
d grow proportionally with the signal dimension n. In such scenarios, one can 
show that the worst-case computational complexity is still bounded by 0(n 3 ). 
In this section, we discuss Iterative Shrinkage-Thresholding (1ST) methods [16, 
14, 26, 50], whose implementation mainly involves simple operations such as 
vector algebra and matrix-vector multiplications. This is in contrast to most 
past methods that all involve expensive operations such as matrix factorization 
and solving linear least squares problems. A short survey on the applications of 
1ST can be found in [54] . 

In a nutshell, 1ST considers solving (Pi, 2) as a special case of the following 
composite objective function: 

minima;) = f(x) + \g(x), (35) 

X 

where / : R n — >■ R is a smooth and convex function, and g : R n — > R as 
the regularization term is bounded from below but not necessarily smooth nor 
convex. For £i-min in particular, g is also separable, that is, 

n 

9( x ) =^9i{Xi). (36) 

Clearly, let f(x) = |||6— Ax\\\ and g{x) = \\x\\\. Then the objective function 
(35) becomes the unconstrained BPDN problem. 

The update rule to minimize (35) is computed using a second-order approx- 
imation of / [50, 5[: 

x (k+i) = wzm m x {f{x^) + {x-x^) T Wf{x^) 
+ \\\x - x^\\l • V 2 /(* (fe) ) + A 5 (a;)} 
s=s argmin a .{(x — x^) T V f(x^) 

+ ^\\x-x^\\l + \g{x)} 
= argmin 3; {i||a;-M(' c )||^ + ^ T g(a;)}, 
= G Q( ») (*<*>), 



(37) 
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where 

u (k) = x (k) _ _L^a)i ( 38 ) 

en > 

In (37), the Hessian V 2 f(x^) is approximated by a diagonal matrix a^I. 

If we replace g(x) in (37) by the ^-norm ||a;||i, which is aseparable function, 
then G a (k)(x^) has a closed-form solution w.r.t. each component: 



(fe+i) • J (^i-«f) 2 , A|xi| 



x\ = argmm Xi 



,(/.-) 



(39) 



= soft( W f\^), 

where 

soft(w, a) = sgn(u) max{|u| — a, 0} 

r sgn(u)(|u| - a) if \u\ > a (40) 

otherwise 

is the soft-thresholding or shrinkage function [18]. 

There are two free parameters in (37), namely, the regularizing coefficient 
A and the coefficient a^ that approximates the hessian matrix V 2 /. Different 
strategies for choosing these parameters have been proposed. Since al mimics 
the Hessian V 2 /, we require that a< fe )(a;( fe ) - a;^- 1 )) w Vf(x^) - Vf{x {k -^) 
in the least-squares sense. Hence, 

0,0+1) _ argmino, ||a(a;( fc ) — a;^ -1 )) 

_(v/( :c ( fe ))-v/( :c ( fe - 1 )))||i , n 

(x< fc )-x< fc - 1 ») r (V/( J; < fc >)-V/(c t; "°- 1 ») V j 

(x( fc )-x( fc - 1 )) T (x( fc )-a;( fc - 1 )) 

This is known as the Barzilai-Borwein equation [3, 43, 50]. 

For choosing A, instead of using a fixed value, several papers have proposed a 
continuation strategy [26, 23], in which (37) is solved for a decreasing sequence 
of A. As mentioned in Section 2.2, (37) recovers the optimal £i-min solution 
when A — > 0. However, it has been observed that in practice the performance 
degrades by directly solving (37) for small values of A, which has been dubbed as 
a "cold" starting point. Instead, continuation employs a warm-starting strategy 
by first solving (37) for a larger value of A, then decreasing A in steps towards 
its desired value. 

The 1ST algorithm is summarized in Algorithm 3. 7 The interested reader is 
referred to [33] for guidelines on choosing good soft-thresholding parameters to 
improve accuracy. 

2.4 Proximal Gradient Methods 

Proximal Gradient (PG) algorithms represent another class of algorithms that 
solve convex optimization problems of the form (35). Assume that our cost 



7 A MATLAB implementation called Sparse Reconstruction by Separable Approximation 
(SpaRSA) [50] is available at http://www.lx.it.pt/~mtf/SpaRSA/. 
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Algorithm 3 Iterative Shrinkage-Thresholding (1ST) 

Input: A full rank matrix A e M dx ™, d < n, a vector 
b e R d , Lagrangian An, initial values for x^ and a , k <— 
0. 
1: Generate a reducing sequence An > Ai > • • • > A at. 
for i = 0, l,--- ,7V do 
A «- A 2 
repeat 
fc <- fc + 1. 

Update a^' using (41). 
until The objective function F(x^) decreases. 
end for 
Output: x* «- x^. 



function F(-) can be decomposed as the sum of two functions / and g, where / 
is a smooth convex function with Lipschitz continuous gradient, and g is a con- 
tinuous convex function. The principle behind these algorithms is to iteratively 
form quadratic approximations Q(x, y) to F(x) around a carefully chosen point 
y, and to minimize Q(x, y) rather than the original cost function F. 

For our problem, we define g(x) = \\x\\i and f(x) = ^\\Ax — &|| ^ - We 
note that V f(x) = A T (Ax — b) is Lipschitz continuous with Lipschitz constant 
L f = || A|| 2 . Define Q(x,y) as: 

Q(x,y) = f(y) + (Vf(y),x - y) + ^ \\x - yf +\g(x). (42) 

It can be shown that F(x) < Q(x,y) for all y, and 

argmin Q(x,y) = argmin < Xg(x) -\ ||a; — u\\ > , (43) 

X X I 2 j 

where u = y— y^V/(y) by the same trick used in (37). For the ^-min problem, 
(43) has a closed-form solution given by the soft-thresholding function: 

argmin Q(x,y) = soft I u, — I . (44) 

V L fJ 

However, unlike the iterative thresholding algorithm described earlier, we use 
a smoothed computation of the sequence y k . It has been shown that choosing 

tk V / 

where {£&} is a positive real sequence satisfying t\ — tk < t\_ x , achieves an 
accelerated non-asymptotic convergence rate of 0(k~ 2 ) [38, 39, 5]. To further 
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accelerate the convergence of the algorithm, one can also make use of the con- 
tinuation technique described in Section 2.3. 

Finally, for large problems, it is often computationally expensive to directly 
compute Lf — ||^4|| 2 . 8 A backtracking line-search strategy [5] can be used to 
generate a scalar sequence {L^} that approximates Lf. We define 

Q L (x,y) = f(y) + (x- y) T Vf(y) + |||aj - y\\ 2 + \g(x). (46) 

Suppose that r\ > 1 is a pre-defined constant. Then, given y^ at the fcth 
iteration, we set Lk = if £fc_i, where j is the smallest nonnegative integer such 
that the following inequality holds: 

F{G Lh {yM))<Q Lh {G Lk {yM),yM), (47) 

where G L (y) = argmin x Q L (x,y) = soft («, ^) for u = y - j-Wf(y). 

The algorithm, dubbed FISTA in [5], is summarized as Algorithm 4. 9 The 
convergence behavior of FISTA is given by 



F(x^)-F(x*)< ™* * " , Vfc. (48) 



2L f \\x^- -'^ 
(fc+1) 2 

The interested reader may refer to [39, 5, 6] for a proof of the above result. 

Algorithm 4 Fast Iterative Shrinkage-Threshold Algorithm (FISTA) 
Input: beR m , Ae W nxn . 

1: Set x(°) <- 0, X {1S > <- 0, t <- 1, *i <- 1, k <- 1. 

Initialize L , Ai, B £ (0. 1), A > 0. 
while not converged do 

,(fe) <- x (k) _|_ tk-l-1 / x (fe) _ ^(k-1)) 



y\") 4r- X^' + "•"* {X^> - X' 

Update Lk using (47) with y^ k \ 
u (k) ^ y (k) __L A T {AyW-b). 







l + A 


Z 4 *^ 1 




X 


ife+i 


< " 


2 




9 


Afc+i 


■(— max(/3 Afe 


A) 


10 


fc^ 


fc + 1. 






11 


end w 


hile 






Oi 


jtput: 


cc* <— 


a;( fe ). 





2.5 Augmented Lagrange Multiplier Methods 

Lagrange multiplier methods are a popular class of algorithms in convex pro- 
gramming. The basic idea is to eliminate equality constraints and instead add 



This problem occurs in the 1ST algorithm as well. 
9 An implementation of FISTA is provided on the website of this paper. Another Matlab 
toolbox called NESTA [6] is available at: http://www.acm.caltech.edu/~nesta/. 
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a penalty term to the cost function that assigns a very high cost to infeasi- 
ble points. Augmented Lagrange Multiplier (ALM) methods differ from other 
penalty-based approaches by simultaneously estimating the optimal solution 
and Lagrange multipliers in an iterative fashion. 

We consider the general ^-min problem (1) with the optimal solution x* . 
The corresponding augmented Lagrange function is given by 

L^x,y) = \\x\U + (y,b-Ax) + ^\\b- Ax\\ 2 2 , (49) 

where fi > is a constant that determines the penalty for infeasibility, and 
y is a vector of Lagrange multipliers. Let y* be a Lagrange multiplier vector 
satisfying the second-order sufficiency conditions for optimality (see section 3.2 
in [8] for more details). Then, for sufficiently large /i, it can be shown that 

x* = argmin Ljx,y*). (50) 

X 

The main problem with this solution is that y* is unknown in general. Fur- 
thermore, the choice of [i is not straightforward from the above formulation. It 
is clear that to compute x* by minimizing L^(x,y), we must choose y close 
to y* and set /i to be a very large positive constant. The following iterative 
procedure has been proposed in [8] to simultaneously compute y* and x*: 

x k+1 = argmin x L^ k (x,y k ) ,_ 

Vk+i = Vk+M b - Ax k+1 ) ' 

where {^k} is a monotonically increasing positive sequence. We note that the 
first step in the above procedure is itself an unconstrained convex optimization 
problem. Thus, the above iterative procedure is computationally efficient only if 
it is easier to minimize the augmented Lagrangian function compared to solving 
the the original constrained optimization problem (1) directly. 

We focus our attention on solving the first step of (51) for the £i-min prob- 
lem. Although it is not possible to obtain a closed- form solution, the cost 
function has the same form as described in (35). Furthermore, the quadratic 
penalty term is smooth and has a Lipschitz continuous gradient. Therefore, we 
can solve it efficiently using proximal gradient methods (e.g., FISTA) described 
in Section 2.4. 

The entire algorithm is summarized as Algorithm 5, where r represents the 
largest eigenvalue of the matrix A T A, and p > 1 is a constant. Although the 
steps in Algorithm 5 closely resemble those of Algorithm 4, we will later see 
that ALM in general is more accurate and converges faster. 

We would like to point out that a similar algorithm called Alternating Di- 
rections Method (ADM) [53] essentially shares the same idea as above. 10 The 
major difference is that ADM would approximate the solution to the first step 
of the ALM iteration in (51). This is done by computing only one iteration of 

10 A MATLAB toolbox of the ADM algorithm called YALL1 is provided at: http://yal.ll. 
blogs .rice . edu/. 
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Algorithm 5 Augmented Lagrange Multiplier (ALM) 

Input: beR m , Ae W nxn . 
1: while not converged (fc = 1,2,.. .) do 

2: ti <r- 1, z\ <- ajfe, tti «- a; fe 

3: while not converged (7 = 1, 2, . . .) do 

4: u, +1 <- soft (z, - \A T [A Zl b ±y k ) , ^ 

5: tj+i «- | (l + v/1 + At 2 



2 v ^ v x ^^"/ 

1-1 

(+1 



Zj+i <- M;+i + ^xr(«J+l - "0 



end while 

J/fe+i ^Vk + Vk(b - Ax k+1 ) 
Mfe+i ^ p- Hk 
end while 



Output: a;* <— CEfc. 



the FISTA algorithm. Although this approximation yields a gain in computa- 
tional speed, we have observed from experiments that the convergence behavior 
of ADM may vary depending on the distribution of the matrix A. 

Algorithm 5 summarizes the ALM method applied to the primal problem 
(1), which is referred to as Primal ALM (PALM) in this paper. Interestingly, 
the principles of ALM can also be applied to its dual problem [53] : 

maxb T y subj. to A T y£Bf, (52) 

v 

where Bj° = {x e R" : ||a;||oo < !}■ Under certain circumstances, this may 
result in a more computationally efficient algorithm. In the rest of the section, 
we will briefly explain the Dual ALM (DALM) algorithm. 

Simple computation shows that DALM solves the following problem: 

mi%, x , z -b T y-x T {z- A T y) + ^\\z- A T y\\ 2 2 , ■, 

subj. to z £ Bf . " [ } 

Here, x as the primal variable becomes the associated Lagrange multiplier in the 
dual space [53]. Since it is difficult to solve the above problem simultaneously 
w.r.t. y, x and z, we adopt a strategy by alternately updating the primal 
variable x and the dual variables y and z. 

On one hand, for x = x k and y = y k , the minimizer z k+ \ with respect to z 
is given by 

z k+1 =V Br (A T y k + x k /p), (54) 

where Vb°° represents the projection operator onto BJ°. On the other hand, 
given x = x k and z = Zfc+i, the minimization with respect to y is a least squares 
problem, whose solution is given by the solution to the following equation: 

(3AA T y = f3Az k+1 - (Ax k - b) . (55) 
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Suppose that AA T is invertible. Then, we directly use its inverse to solve (55). 
However, for large scale problems, this matrix inversion can be computationally 
expensive. Therefore, in such cases, we can approximate the solution with one 
step of the Conjugate Gradient algorithm in the y direction at each iteration, 
as proposed in [53]. 

The basic iteration of the DALM algorithm is given by 11 



z k +i = V A ^(A T y k + x k /P), 

y k+l = {AA T )-\Az k+l - (Ax k - b)/p), (56) 

fc+i)i 



x k+1 = ajfe - I3{z k+1 - A T y 



As we will see in Section 4, when the matrix AA T is easily invertible, it is 
possible to solve (55) exactly. Since all the subproblems now are solved exactly, 
the convergence of the dual algorithm is guaranteed. Furthermore, since its 
major computation lies in solving for the dual variable y, when the number of 
dual variables are much smaller than the number of primal variables (i.e., when 
d <C n), the dual algorithm could be more efficient than the primal algorithm. 

3 Simulation: Random Sparse Signals 

In this section, we present two sets of experiment to benchmark the perfor- 
mance of the five fast ^i-min algorithms on random sparse signals, namely, 
L1LS/TNIPM, Homotopy, SpaRSA/IST, FISTA, and DALM, together with 
the basic BP algorithm (i.e., PDIPA). All experiments are performed in MAT- 
LAB on a Dell PowerEdge 1900 workstation with dual quad-core 2.66GHz Xeon 
processors and 8GB of memory. 

Before we present the results, we need to caution that the performance of 
these algorithms is affected by multiple factors, including the algorithm imple- 
mentation, the chosen ad-hoc parameters, and even the MATLAB environment 
itself. One factor that we should pay special attention to is the stopping criteria 
used in benchmarking these algorithms. As we first mentioned in Section 1.2, 
choosing a good stopping criterion is important to properly exit an iteration 
when the estimate becomes close to a local or global optimum. On one hand, 
in general, straightforward rules do exist, such as the relative change of the 
objective function: 

\\F(xl k +»)-F(xW)\\ 

\\F(x( k ))\\ ' l ' 

or the relative change of the estimate: 

|| a .(fc+i)_ a .(fc)|| 

||xW|| ' (58) 

However, their efficacy depends on a proper step size of the update rule: If the 
step size is poorly chosen, the algorithm may terminate prematurely when the 



11 The PALM and DALM algorithms in MATLAB can be downloaded from the website of 
this paper. 
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estimate is still far away from the optimum. On the other hand, certain special 
criteria are more effective to some algorithms than the others. For example, for 
PDIPA, it is natural to use the (relative) duality gap between the primal and 
dual solutions; for Homotopy, it is easy to measure the change of the sparse 
support as the stopping criterion. 

In this section, since the experiment is performed on synthetic data, we use 
a simple threshold that compares the £2-norm difference between the ^i-min 
estimate x* and the ground truth Xq as the stopping criterion. 

3.1 p-5 Plot in the Noise-Free Case 

The first experiment is designed to measure the accuracy of the various algo- 
rithms in recovering sparse signals in the noise- free case (Pi). We evaluate each 
algorithm using a p-5 plot, where the sparsity rate p = k/n G (0,1] and the 
sampling rate 5 = d/n £ (0, 1]. At each 5, the percentages of successes that an 
£i-min algorithm finds the ground-truth solution Xq (with a very small toler- 
ance threshold) are measured over different p's. Then a fixed success rate, say 
of 95%, over all 5's can be interpolated as a curve in the p-5 plot. In general, 
the higher the success rates, the better an algorithm can recover dense signals 
in the (Pi) problem. 

Figure 2 shows the 95% success-rate curves for the six algorithms. In the 
simulation, the ambient dimension d = 1000 is fixed. For a fixed pair (p, 5), the 
support of the ground-truth signal x is chosen uniformly at random, and the 
values of the non-zero entries are drawn from the standard normal distribution. 
In addition, the vector is normalized to have unit £2-norm. The measurement 
matrix A is generated as a random Gaussian matrix, each of whose entries is i.i.d. 
randomly generated from the standard normal distribution and then normalized 
to have unit column norm. We choose to compute the average success rates over 
100 trials on the vertices of a grid of (p, 5) pairs for each of the ^i-min algorithms, 
and the coordinates of the 95% rate are interpolated from the vertex values. 



i 



- PDIPA 
-e-ULS 
-9- HOMOTOPY 

» SpaRSA 
•o ■ FISTA 
"O-DALM 




0.2 0.3 0-4 0-5 0.6 0.7 0.S 

Sampling Rate 



Fig. 2: The p-5 plot (in color) shows the 95% success-rate curves for the six fast 
£i-min algorithms. 
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The observations of the experiment are summarized below: 

1. Without concerns about speed and data noise, the success rates of the 
interior-point method PDIPA is the highest of all the algorithms compared 
in Figure 2, especially when the signal becomes dense. 

2. The accuracy for the remaining five algorithm is separated in three groups. 
In particular, the success rates of L1LS and Homotopy are similar w.r.t. 
different sparsity rates and sampling rates, and they are also very close 
to PDIPA. In the low sampling-rate regime, Homotopy is slightly better 
than L1LS. 

3. The success rates of FISTA and DALM are comparable over all sampling 
rates. In the noise-free case, they are not as accurate as the other exact 
£i-min solutions. However, their performance shows a significant improve- 
ment over the 1ST algorithm, namely, SpaRSA. 

3.2 Performance with Moderate Data Noise 

We are more interested in comparing the ^i-min algorithms when the measure- 
ment contains moderate amounts of data noise. In the second simulation, we 
rank the six algorithms under two scenarios: Firstly, we measure the perfor- 
mance in the low-sparsity regime, where the ambient dimension n = 2000 and 
the sparsity ratio p = k/n = 0.1 are fixed, and the dimension d of the Gaussian 
random projection varies from 800 to 1900. Secondly, we measure the perfor- 
mance when x becomes dense w.r.t. a fixed sampling rate, where n — 2000 and 
d = 1500 are fixed, and the sparsity ratio p = k/n varies from 0.1 to 0.26. The 
maximum number of iterations for all algorithms is limited to 5000. The results 
are shown in Figure 3 and 4, respectively. In both experiments, we corrupt the 
measurement vector b with e, an additive white noise term whose entries are 
i.i.d. according to a Gaussian distribution with zero mean and variance 0.01. 

Firstly, when a low sparsity ratio of p = 0.1 is fixed in Figure 3, ^i-min 
becomes better conditioned as the projection dimension increase. However, 
the computation cost also increases with the projection dimension. We then 
compare the performance of the six algorithm in Figure 3: 

1. The computational complexity of PDIPA grows much faster than the other 
fast algorithms. More importantly, in contrast to its performance in the 
noise-free case, the estimation error also grows exponentially, in which case 
the algorithm (i.e., the update rule (11)) fails to converge to an estimate 
that is close to the ground truth. 

2. As the projection dimension increases, the five fast ^i-min algorithms all 
converge to good approximate solutions. The estimation error of Homo- 
topy is slightly higher than the rest four algorithms. 

3. In terms of speed, L1LS and Homotopy take much longer time to converge 
than SpaRSA, FISTA, and DALM. 



A Review of Fast ^i-Minimization Algorithms for Robust Face Recognition 



20 




'*--*--♦--♦- -^ -*■==#' = * = =*"* = 
ioco isoo moo ia» 1000 
Projection Dimension 

(a) Average run time 















- • - SpaHSA 




0.35 




-e-FISTA 


„e--° 






-*-DALM 




S 0.3 


■Q* 


1 


jo- -& 


3 

EC 




« 0.2 

s 

^ 0.15' 


k — - * 


C.I 




< 

0.3. 




■I 


1M0 1200 1400 1600 iaoo ?oc 



Projection Dimension 
(b) Average run time (detail) 




1000 iaoo uoo 1600 isoo sooe 
Projection Dimension 



(c) Average ^2-norm error 



Fig. 3: Comparison of the six fast £i-min algorithms w.r.t. a fixed sparsity ratio 
(n = 2000, k — 200), and varying projection dimension d from 800 to 
1900. 
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Fig. 4: Comparison of the six fast £i-min algorithms w.r.t. a fixed sampling 
ratio (n = 2000, d — 1500), and varying sparsity ratio k/n from 0.1 to 
0.26. 



4. The average run time of DALM is the shortest over all projection dimen- 
sions, which makes it the best algorithm in this comparison. 
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Secondly, when the projection dimension d = 1500 is fixed in Figure 4, we 
compare both the average run time and the average estimation error when the 
sparsity varies: 

1. PDIPA significantly underperforms the rest five fast algorithms in terms 
of both accuracy and speed, consistent with the result in the previous 
simulation with noisy data. 

2. The average run time of Homotopy grows almost linearly with the sparsity 
ratio, while the other algorithms are relatively unaffected. Thus, Homo- 
topy is more suitable for scenarios where the unknown signal is expected 
to have a very small sparsity ratio. 

3. The computational cost of the rest four algorithms, namely, L1LS, SpaRSA, 
FISTA, and DALM, remains largely unchanged when the random projec- 
tion dimension d is fixed. 

4. DALM is the fastest algorithm in the low-sparsity regime, but its run 
time approaches that of the other first-order methods in the high-sparsity 
regime. Overall, DALM is the best algorithm in this scenario. 

To summarize, in the presence of low levels of Gaussian noise, PDIPA per- 
forms quite badly in comparison with the other ii-uun algorithms that employ 
more sophisticated techniques to handle noisy data. Perhaps a more surpris- 
ing observation is that when the coefficients of A are randomly drawn from a 
Gaussian distribution, under a broad range of simulation conditions, FISTA is 
not significantly better than the original 1ST algorithm. 12 However, in the next 
section, we will see that when the dictionary A is replaced by real-world data 
matrices that contain training images in face recognition, the performance of 
the six ^i-min algorithms would be dramatically different from synthetic data. 

4 Face Recognition Experiment I: Corruption and Disguise 
Compensation 

In this section, we benchmark the performance of the six algorithms in robust 
face recognition. The experiment is set up to estimate sparse representation 
of real face images based on the cross- and-bouquet (CAB) model introduced in 
[47]. 

More specifically, It has been known in face recognition that a well-aligned 
frontal face image b under different lighting and expression lies close to a special 
low-dimensional linear subspace spanned by the training samples from the same 
subject, called a face subspace [7, 4]: 

A i = [v itU v i ,2,--- ,v i>ni ]€R dxni , (59) 

12 We have observed that the performance of FISTA may vary widely depending on its chosen 
parameters. If the parameters are tuned at different noise level, its speed of convergence can 
be much improved compared to 1ST. However, for consistency, all algorithms in this simulation 
are assigned a fixed set of parameters. 
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where Vij represents the j-th training image from the i-th subject stacked in 
the vector form. Given C subjects and a new test image b (also in the vector 
form) , we seek the sparsest linear representation of the sample with respect to 
all training examples: 



b=[A 1 ,A 2 



,Ac][xi;x 2 ;--- ;x c ] = Ax, 



(60) 



where A £ R dx ™ collects all the training images. 

Clearly, if b is a valid test image, it must be associated with one of the 
C subjects. Therefore, the corresponding representation in (60) has a sparse 
representation x = [• • • ; 0; £C»; 0; ■ ■ ■ ]: on average only a fraction of i coefficients 
are nonzero, and the dominant nonzero coefficients in sparse representation x 
reveal the true subject class. 

In addition, we consider the situation where the query image b may be 
severely occluded or corrupted. The problem is modeled by a corrupted set of 
linear equations b = Ax + e, where e £ M d is an unknown vector whose nonzero 
entries correspond to the corrupted pixels. In [49], the authors proposed to 
estimate w = [x: e] jointly as the sparsest solution to the extended equation: 



min \\w\\i subject to b = [A, I]w. 



(61) 



The new dictionary [A, I] was dubbed a cross-and-bouquet model in the follow- 
ing sense. The columns of A are highly correlated, as the convex hull spanned by 
all face images of all subjects occupies an extremely tiny portion of the ambient 
space. These vectors are tightly bundled together as a "bouquet," whereas the 
vectors associated with the identity matrix and its negative ±7 form a standard 
as shown in Figure 5. 



cross m 





Highly coherent 

(volume < 1.5 x 10" 229 ) 



5: The cross-and-bouquet model for face recognition. The raw images of hu- 
man faces expressed as columns of A are clustered with very small variance. 
(Courtesy of John Wright [47]) 



In this section, the performance of the six ^-min algorithms using the CAB 
model is benchmarked on the CMU Multi-PIE face database [25]. A subset 
of 249 subjects from the data set (Session 1) is used for this experiment. Each 
subject is captured under 20 different illumination conditions with a frontal pose. 
The images are then manually aligned and cropped, and down-sampled to 40x30 
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pixels. Out of the 20 images for each subject, images {0,1,7,13,14,16,18} 
with extreme illumination conditions are chosen as the training images, and the 
remaining 13 images are used as test images. Finally, a certain number of image 
pixels are randomly corrupted by a uniform distribution between [0,255], with 
the corruption percentage from 0% to 90%, as four examples shown in Figure 
6. In Table 1, the recognition rates between 50% and 70% pixel corruption are 
highlighted for more detailed comparison. 




Fig. 6: An aligned face image of Subject 1 in Multi-PIE, Session 1, under the ambient 
lighting condition (No. 0) is shown on the left. On the right, 20%, 40%, 60%, 
and 80% of image pixels are randomly selected and corrupted with uniformly 
distributed values in [0, 255], respectively. 

Fig. 7: Recovered face images (b — b—e) from Figure 6 using the CAB model (61) and 
Homotopy. Left: Reconstructed image from 20% pixels corruption. Middle 
Left: Reconstructed image from 40% pixels corruption. Middle Right: Re- 
constructed image from 60% pixels corruption. Right: Reconstructed image 
from 80% pixels corruption. 

We measure the performance of the algorithms in terms of the final recog- 
nition rate and the speed of execution. In choosing a proper stopping criterion, 
the stopping threshold is individually tuned for each algorithm to achieve the 
highest recognition rate. The results are shown in Figure 8. In addition, Figure 
7 shows the reconstructed face images after corruption compensation, namely, 
b = b e, given by the Homotopy solver. 



Tab. 1: Average recognition rates (in percentage) between 50% and 70% random pixel 
corruption on the multi-PIE database. The highest rates are in boldface. 



Corr. 


PDIPA 


L1LS 


Homotopy 


SpaRSA 


FISTA 


DALM 


50% 


99.8 


99.5 


99.8 


97.6 


96.2 


99.5 


60% 


98.6 


96.6 


98.7 


90.5 


86.8 


96.2 


70% 


84.1 


76.3 


84.6 


63.3 


58.7 


78.8 
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Fig. 8: Left: Average recognition accuracy (in percentage) on the Multi-PIE database. 
Right: Average run time (in second) on the Multi-PIE database. 



In terms of accuracy, Homotopy achieves the best overall performance. For 
instance, with 60% of the pixels randomly corrupted, its average recognition 
rate based on the CAB model is about 99%. The performance of PDIPA is very 
close to Homotopy, achieving the second best overall accuracy. On the other 
hand, FISTA obtains the lowest recognition rates, followed by SpaRSA. 

In terms of speed, Homotopy is also one of the fastest algorithm. In particu- 
lar, when the pixel corruption percentage is small, the sparsity of the coefficient 
vector w = [x; e] is very small. As Homotopy iteratively adds or removes 
nonzero coefficients one at a time, the algorithm can quickly terminate after 
just a few iterations. On the other hand, as w becomes dense when the pixel 
corruption percentage increases, the complexity of Homotopy increases superlin- 
early and becomes inefficient. It is also important to note that the speed of the 
two accelerated iterative-thresholding algorithms, namely, FISTA and DALM, 
does not increase significantly as the sparsity of w increases. 

It is more interesting to separately compare PDIPA, L1LS, and Homotopy, 
which provably solve the (Pi) problem, and SpaRSA, FISTA, and DALM, which 
essentially solve a relaxed version of (Pi). Overall, PDIPA, L1LS and Homotopy 
perform similarly in terms of the accuracy, consistent with our previous obser- 
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vations in the simulation. However, L1LS is significantly more expensive from a 
computation point of view to achieve the same recognition rates as Homotopy. 
Among the three iterative soft-thresholding methods, the experiments suggest 
that DALM is the most accurate. 

5 Face Recognition Experiment II: Face Alignment 

In this section, we benchmark the performance of the ^i-min algorithms in ro- 
bust face alignment within the sparse representation framework. While the 
previous section has demonstrated that solving the CAB model achieves state- 
of-the-art recognition accuracy on public datasets even when the test face images 
are severely corrupted or occluded, its success still relies on the assumption that 
the test images are well aligned with the training images. However, this assump- 
tion is not always satisfied in practice. As illustrated in Figure 9 top, a test face 
image obtained from an off-the-shelf face detector is likely to have some regis- 
tration error against the training images. The registration error may contribute 
to erroneous linear representation of the query image, even if sufficient illumi- 
nations are present in the training set. Fortunately, this alignment issue can 
be naturally addressed within the sparse representation framework. It has been 
shown in [46] that the face alignment problem can be solved by a series of linear 
problems that iteratively minimize the sparsity of the registration error, which 
leads to an effective algorithm that works for face images under a large range of 
variations in scale change, 2-D translation and rotation, and different 3-D poses. 
Furthermore, it inherits the robustness property from the sparse representation 
framework, and hence is also able to deal with faces that are partially occluded. 
Recall that a well-aligned test image b £ ~R d can be represented as a sparse 
linear combination Axq of all of the images in the database, plus a sparse error 
eo due to occlusion: 6o = Axq + en- Now suppose that bo is subject to some 
misalignment, so instead of observing 6o ; we observe the warped image 

b = b o t- 1 (62) 

for some transformation r £ T, where T is a finite-dimensional group of trans- 
formations acting on the image domain. As seen in Figure 9, directly seeking a 
sparse representation of b against the (well-aligned) training images is no longer 
appropriate. However, if the true deformation t" 1 can be efficiently found, then 
we can apply its inverse r to the test image, and it again becomes possible to 
find a good sparse representation of the resulting image: 

bor = Ax + e. (63) 

Naturally, we would like to use the sparsity as a cue for finding the correct 
deformation r. This can be formulated as the following optimization problem: 

min ||je||i + ||e||i subj to bo t = Ax + e. (64) 

x,e,r£T 

However, this is a difficult nonconvex optimization problem. Furthermore, due 
to the concern of local minima, directly solving (64) may simultaneously align 
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Fig. 9: Effect of registration. The task is to identify the girl among 20 sub- 
jects, by computing the sparse representation of her input face with 
respect to the entire training set. The absolute sum of the coefficients 
associated with each subject is plotted on the right. We also show the 
faces reconstructed with each subject's training images weighted by the 
associated sparse coefficients. The red line (cross) corresponds to her 
true identity, subject 12. Top: The face region is extracted by Viola 
and Jones' face detector (the black box). Bottom: Informative repre- 
sentation obtained by using a well-aligned face region (the white box). 
(Courtesy of [46]). 



the test image to different subjects in the database. Therefore, it is more ap- 
propriate to seek the best alignment with each subject i in the database [46]: 



arg mm 



subj to b o n = Aix + e. 



(65) 



In (65), ||a;||i is not penalized, since Ai <G M dx ™; only contains the images of 
subject i and x is no longer expected to be sparse. While (65) is still nonconvex, 
when a good initial estimation for the transformation is available, e.g., from the 
output of a face detector, one can refine this initialization to an estimate of the 
true transformation by repeatedly linearizing about the the current estimate of 
Tj. This idea leads to the final problem formulation as a convex optimization 
problem as follows: 



mm 

x,e,ATiST 



|e||i subj to bo Ti + JiAri = AiX + e. 



(66) 



Here, Ji = J^b ot; £ R dxqi is the Jacobian of b o r t with respect to the 
transformation parameters Ti, and Ar^ is the update direction w.r.t. r^. 

During each iteration k, the current alignment parameters t\ correct the 
observation as b 



(A') 



r (k) 



Denote B 



(k) 



[Ai,-Jf>] € R dx K+90 ail d 
w = [x T , Ar 2 T ] T , then the update At, can be computed by solving the following 
problem: 



mm | 

■w . e 



subj to 



b) = B) 'w 



(67) 
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Alignment experiment results, (a) Examples of perturbation. The green 
bounding boxes indicate the initial face location with some registration 
error; the red bounding boxes show the final position after face align- 
ment, (b) Average pixel error (lower the better), (c) Rate of Success 
(higher the better), (d) Average run time (lower the better). 



(d) 



Interested readers are referred to [46] for more details about the algorithm and 
comprehensive experimental studies. 

In this section, we benchmark the various £i-min algorithms on the CMU 
Multi-PIE face database. In order to solve (67), we had to modify the code 
for each algorithm so that ||a:||i is no longer penalized and to take care of the 
new constraint. To further speed up the algorithms, we take advantage of the 
fact that the matrix B is usually a tall matrix with much fewer columns than 
rows. For our experiments, the training set contains rii — 7 images per subject, 
and we choose the transformation group T to be the set of similarity transforms 
(therefore g, =4). So, the number of columns in B is just n.; + q. L = 11, while 
the number of rows is equal to the number of pixels in each image. 

Firstly, we modify the PDIPA to obtain a faster algorithm for this problem. 
We recall that PDIPA is computationally expensive mainly because at each 
iteration, we need to solve a linear system of equations the size of the dictionary 
[B,I] for (67). For instance, if d = 40 x 30 = 1200, it results in inverting a 
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matrix of size larger than 1200 x 1200. However, if we know that B has very few 
columns, it is not efficient to consider the dictionary [B,I] as a whole. Instead, 
we can explicitly write down the closed-form solution to the large linear system 
with B and / considered separately. This leads to a much smaller problem that 
only requires inverting a (rii + qi) x (rij + g,) = 11x11 matrix. Hence, the 
complexity of PDIPA is significantly reduced. We note that similar ideas can 
be also applied to speed up the other algorithms. In particular, when matrix B 
is typically an overdetermined matrix, the ALM algorithm that operates in the 
primal space, namely, PALM, is more efficient than DALM in the dual space. In 
the Appendix, modifications for three i^-min algorithms are briefly explained, 
which solve (67) based on GP, Homotopy, and 1ST approaches, respectively. 

The performance of the six ^i-min algorithms is again evaluated using the 
face images from the Multi-PIE database. In this experiment, the 249 subjects 
from the Session 1 are used. Out of 20 illuminations, {0, 1, 7, 13, 14, 16, 18} are 
chosen as the training images and the illumination 10 is used as the query im- 
age. All images are cropped and down-sampled to 40 x 30 pixels. Furthermore, 
the test images are manually perturbed up to 8 pixels along the x and y-ax.es 
in the canonical frame (with size 40 x 30) or up to 45° in-plane degree, respec- 
tively. Moreover, to test the robustness of the ^i-min algorithms to occlusion, 
we replace a random located block of size 10% of the face image with the image 
of a baboon. See Figure 10(a) for an example. 

The accuracy of each algorithm is measured by both the average pixel error 
and the rate of success. We consider an alignment successful if the difference 
between the final alignment is within 3 pixels of the ground truth in the origi- 
nal image frame (640 x 480 image size). As shown in Figure 10(b) and (c), in 
terms of alignment accuracy, PDIPA, PALM and L1LS achieve the best perfor- 
mance, with Homotopy and FISTA slightly worse. Meanwhile, the performance 
of SpaRSA degrades much faster than the others as the perturbation level in- 
creases. On the other hand, Figure 10(d) shows that L1LS, SpaRSA, and FISTA 
are more computational expensive. Overall, PALM is the fastest algorithm, 
which takes 25% to 50% less time compared to PDIPA and Homotopy. 

In conclusion, Both PALM and PDIPA performs very well for the alignment 
problem. Taking into account the implementation complexity and the fact that 
PALM is slightly faster than PDIPA, we conclude that PALM is the best choice 
for this problem. 

6 Conclusion and Discussion 

We have provided a comprehensive review of five fast ^i-min methods, i.e., 
gradient projection, homotopy, soft shrinkage-thresholding, proximal gradient, 
and augmented Lagrange multiplier. Through extensive experiments, we have 
shown that under a wide range of data conditions, there is no clear winner that 
always achieves the best performance in terms of both speed and accuracy. For 
perfect, noise-free data, the primal-dual interior point method (PDIPA) is more 
accurate than the other algorithms, albeit at a much lower speed. For noisy 
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data, first-order £i-min techniques (i.e., SpaRSA, FISTA, and ALM) are more 
efficient in recovering sparse signals in both the low-sparsity and high-sparsity 
regimes. We have also considered special cases of the £i-min problem where it 
is applied to robust face recognition. In this application, Homotopy and ALM 
achieve the highest recognition rates, and their computational cost is also the 
lowest compared to the other i i-min algorithms. 

All experiments described in this paper were carried out in MATLAB. The 
benchmarking in terms of computational speed may vary significantly when the 
algorithms are implemented in other programming languages such as C/CH — h 
The accuracy of different algorithms may also be affected by the precision of 
different floating-point /fixed-point arithmetic logic. To aid in peer evalution, 
all the experimental scripts and data have been made available on our website. 

7 Appendix 

In this section, we briefly discuss how to more efficiently solve the objective 
function (67) by taking advantage of the intrinsic structure of its dictionary 
[B, I]. For clarity, we will rewrite (67) using the following notation: 

min||e||i subj to b = Bw + e. (68) 

7.1 Solving Face Alignment via Gradient Projection 
Methods 

The objective function (68) can be optimized relatively straightforward follow- 
ing the gradient projection approach. In particular, using truncated Newton 
interior-point method, the objective function is rewritten as: 



mm 



±\\b-Bw-e\\ 2 + \j:Li^ ( 69 ) 

subj. to — m < e, < Ui, i = 1, • • • , d. (70) 

Then a logarithmic barrier to bound the domain of — u.i < e^ < Ui can be 
constructed as: 

$(e,w) = -^log(ui + ei)-^log(ui-ei)- (71) 

Over the domain of (w, e,u), the central path is calculated by minimizing 
the following convex function 

1 d 

F t (w,e,u) = t(-\\b - Bw - e|| 2 + X^u,) + $(e,u), (72) 



?=i 



where t is a nonnegative parameter. Then, the update direction (Aw, Ae, Am) 
is solved using Newton's method: 

"Aw" 

V 2 F t (w,e,u) ■ Ae = -VF t (w, e,u). (73) 

Am 
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7.2 Solving Face Alignment via Homotopy Methods 

In the Homotopy formulation, define a composite objective function: 

F(w,e) = -||6-Bw-e|| 2 + A||e||i (74) 

= f(w,e) + \g(e). (75) 

Setting the subgradient OF = <=> —df = Xdg, we obtain the following equali- 
ties: 

d 
X—g(e)=0 = B T (b-Bw-e) (76) 

and 

c(e) = X^-g(e) = (b-Bw-e). (77) 

oe 

In initialization, set e^ ' = and w — B^b. Since we know 9||e||i <G [— 1, 1], 
we initialize the maximal A using (77): 

A = max(abs(c)), (78) 

and the support I = {i : \a\ = A}. 

At each iteration k, first assuming w is fixed, the update direction for e is 

d^(X) = -X-sgn(c^(X)), (79) 

and the direction for the coefficients that are not in the support X is manually 

set to zero. Hence, 

e (fc+D =e « +7d « (80 ) 

In (80), a proper step size value 7 remains to be determined. In Homotopy 
methods, 7 is set as the minimal step size that leads to either a coefficient of c 
that is in the support X crosses zero, and hence should be removed from I\ or 
the magnitude of a coefficient of c outside the support X reaches or exceed A, 
and hence should be added to X. 

The first situation is easy to evaluate: 

7~ = min{-ej/d,}. (81) 

However, choosing 7+ for the second situation presents a problem, as in (77) 
any update in e does not change the values of c outside the support. In other 
words, due to the fact that the dictionary for e where its coefficients are sparse 
is indeed an identity matrix, the original strategy in Homotopy could not add 
any new indices into the support of e. An alternative solution that we use in our 
implementation is to determine 7 and w separately using the first constraint 
(76), and then iterate. 
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7.3 Solving Face Alignment via Iterative Soft-Thresholding 

The composite objective function F(w, e) = f(w , e) + Xg(e) in (75) can be also 
solved by iterative soft-thresholding (1ST) methods. To simplify the notation, 
denote y = [w; e] £ M. n+d . Then the objective function becomes: 

F(y)= 1 -\\b-[B,I]y\\ 2 + \\\e\\ 1 . (82) 

Then the update rule to minimize F(y) is computed by substitutting a linearized 
approximation of / as shown below: 

rdy (fc+1) = argmin{(y-y (fe) ) T V/(y (fc) ) + 
v 

^l|y-y (fe) ll 2 -v 2 /(y (fe) ) + A.g(e)}. (83 ) 

We further introduce an intermediate variable 

« (fc) = V (k) - J^y V/O/W), (84) 

where the hessian V 2 /(j/' fc - ) ) is approximated by the diagonal matrix aS k 'I. 
Then based on the 1ST approach, (83) has a closed-form solution as follows: 
For i < n, 

(fe+l) (fe+l) . \{y%- W» ) 2 1 (fe) /orx 

y\ = w\ = argmin <^ >=u\'\ (85) 



and for i > 



(fe+i) (fe+i) 

Vi = e i-n 



(yj-< k) ) 2 , Mvi\ 



= argmiiij, i ^— + -ffi f- ( 86 ) 

= 80ft(«i fc) ,^ 
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